## Lab Session – Compiler Options and Code Optimizations

## Objectives.

## To run programs using different compiler options and analyze their performance

## To reduce the algorithmic complexity of programs using code optimizations

## To use the roofline performance model in order to identify performance issues

## To Improve cache utilization and performance of programs using code optimizations

## Aim

The **aim** of this lab session is to acquaint yourselves with code optimizations, cache efficiency and performance issues.

## Leveraging on the compiler’s options to improve performance

**Task1**: Download the ‘*Image\_processing*’ file. This is the main kernel of an image processing application. This routine has been written by an inexperienced developer and thus there is much room for improvement. Compile the code by using the following options and measure the execution time in each case. Keep in mind that this option is available in Visual Studio too.

* + -Ofast (optimize very aggressively to the point of breaking standard compliance)
  + ‘-O3’ option
  + ‘-O2’ option
  + ‘-O1’ option
  + ‘-O0’ option (this is the default option)

Gcc provides about 200 optimizations, which can be found in <https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html> . Keep in mind that not all the optimizations aim on reducing the program’s execution time, e.g., some optimizations relate to debugging and security.

As you might have seen in the link above (<https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html>), ‘-O3’ includes all the optimizations included in ‘-O2’ plus some additional ones. However, ‘-O3’ might generate slower code than ‘-O2’. A specific optimization might improve the performance of a specific code block (just for a specific input size) or might degrade performance. Furthermore, the order that the optimizations are applied might strongly impact performance. ‘-O3’ option applies a number of optimizations in a fixed order, but we could manually apply those in a different order and further improve performance. The problem of identifying which optimizations to apply, their parameters as well as their order, remains unsolved. So, using the right optimizations is a matter of experience and expertise.

**Task2**: Compile the ‘*Image\_processing*’ file using ‘-O2’ and any other optimization you like. Find experimentally a good optimization set. An example is shown below:

*gcc image\_processing.c -o p -O2 -funroll-loops -floop-unroll-and-jam*

## Reducing the algorithmic complexity (optional)

**Task3 (optional):** ‘*inefficient\_routine()*’ is a routine written in an inefficient way. Try to simplify the code (reduce its complexity), firstly, by eliminating redundant operations and secondly, by applying as many optimization techniques you have been taught so far (have a look at the lectures).

Some of the optimizations you can apply are:

* the loop kernel in line 64 can be simplified
* the if condition in line 69 can be eliminated
* gauss\_x\_compute[][][] and gauss\_xy\_compute[][][] arrays are redundant and thus can be eliminated
* function inline

Improving Cache Utilization and Performance

## Case study. Matrix-Matrix Multiplication (MMM)

## **Task1**. Download and study ‘mmm.c’ file. Revisit the cache analysis we had for MMM previous week as well as the roofline model. ‘mmm.c’ is a simpler version of the 2mmm.c program we studied last week. ‘mmm.c’ contains just one MMM loop kernel, while 2mmm.c contains two MMM loop kernels. Fig.1 shows the dL1 statistics of 2mmm() program; the statistics of ‘mmm.c’ program are included in Fig.1, so there is no need for profiling the ‘mmm.c’ program again.

## *Fig.1. Valgrind Output for 2mmm.c file*

## ***Roofline Model***: The roofline model [1] provides an easy way to get performance bounds for compute-bound and memory-bound loop kernels. It allows us to know how far the achieved performance is from the optimum. It is based on the concept of computational intensity, sometimes also called arithmetic or operational Intensity. The arithmetic intensity is given by the following formula: ‘*FP.arithmetical.instructions / number.of.bytes.loaded.stored*’. This model has several limitations [1], e.g., does not consider all features of modern processors and ignores integer computations.

## ***Roofline Model – MMM algorithm***: MMM has N3 iterations and each iteration contains 4 Floating Point (FP) L/S operations and 2 FP arithmetical operations. So, the arithmetical intensity of MMM (the ratio between FP arithmetical operations and number of bytes loaded/stored), when the arrays are of type ‘float’, is 2/(4\*4bytes)=1/8.

## 

## *Fig.2. The Roofline Model [1]*

## Algorithms that have a low arithmetical intensity are **memory-bound,** while algorithms that have a high arithmetical intensity are **compute-bound**. Memory-bound means that their performance is bounded on the memory latency and bandwidth values; in simple words, no matter how fast the CPU is, it stays idle waiting for the data to be loaded/stored from/to memory.

*Attainable GFLOPS = min(Peak Floating Point Performance, Peak Memory Bandwidth x Arithmetical Intensity)*

## The peak FP performance is the maximum we can get, and is CPU dependent. The peak memory bandwidth depends on the DDR and memory controller hardware characteristics. Furthermore, if the data fit in a cache memory and they are always accessed from there, the peak memory bandwidth is the cache bandwidth. So, if the peak memory bandwidth is 21GBytes/sec, then the maximum MMM performance will be 21GB/sec x (1/8)=2.65gigaflops. If the arrays fit in the precious cache memories, then the memory bandwidth is higher and thus performance is increased.

## ***How can we improve the performance of memory-bound algorithms?*** The main strategies are as follows. Reducing the number of memory accesses through the whole memory hierarchy; this relates to reducing the number of cache-misses too. Another strategy is to use software prefetching. The above can be achieved by using code optimizations such as loop tiling, register blocking, array copying, loop merge/distribution etc.

## ***Task2***. Measure the GFLOPS achieved on your CPU for the mmm() routine. Use different input sizes. Use the ‘-O2’ compiler option. As the input size increases, the MMM performance is reduced. This is because the arrays can no longer fit either in the dL1 or L2 of L3 cache.

## **Task3.** Repeat Task2 using ‘-O3’ compiler option. The performance achieved is higher as auto-vectorization is applied as well as loop optimizations such as loop tiling.

## **Task4**. Run ‘flops.c’ program; you will see printed the number of flops achieved. Why GFLOPS value of this program is that high, comparing to MMM? Flops() routine is written using AVX x86-64 intrinsics. If you remember you have seen those in soft261. Although these intrinsics will be further explained next week, in this module, we will not be focusing on SSE/AVX programming, so remain calm. Flops() routine is just a code example I wrote to get very high CPU performance.

## I have applied the procedure of task2, task3 and task4 on my PC and the results are shown in Fig.2. ‘flops.c’ program has an arithmetic intensity value of 2, while MMM a value of 1/8. This means that flops() routine is compute-bound while MMM is memory-bound. Furthermore, flops() routine has an excellent balance between add and multiply instructions, allowing us to use fused add/multiply instructions. All the fmadd instructions feed the pipeline one after another maximizing performance. To sum up, flops() routine gets the most out of our CPU performance (81 GigaFLOPS) while MMM is far away from it (we can improve its performance thought).

## *Fig.3. Results from Task2, Task3 and Task4*

Register Blocking as a solution to memory-bound algorithms - Reduce the number of memory accesses

## **Task5**: Apply register blocking transformation to mmm() and check the a) the performance gain (using the timer), b) the total number of memory accesses (‘Dr’ and ‘Dw’ columns), c) the gain in FLOPS. Apply register blocking only to j loop with a factor of 4, 8 and 12. Use ‘-O2’ option. Compare the results you found with the results prior to register blocking. Pay attention to the ‘Dr’ (data reads) column. It is important to note that the register blocking factor must perfectly divide N. Make sure your code generates the right output before you use valgrind. To this end, you could perhaps write another function that compares the results.

## *Keep in mind that the above application of register blocking is not the appropriate as normal (short) registers are used and not the wide ones (used for vectorization). We will discuss more about vectorization next week. Proper register blocking must be applied on vectorized code. By using ‘-O3’ option, auto-vectorization is enabled and by applying register blocking as above, the compiler cannot vectorize the code properly. This is why we use ‘-O2’ option.*

## Below the code is shown with an unroll factor of 4

*for (i=0; i<N; i++)*

*for (j=0; j<N; j+=4){*

*c0=C[i][j];*

*c1=C[i][j+1];*

*c2=C[i][j+2];*

*c3=C[i][j+3];*

*for (k=0; k<N; k++) {*

*a=A[i][k];*

*c0+=a\*B[k][j];*

*c1+=a\*B[k][j+1];*

*c2+=a\*B[k][j+2];*

*c3+=a\*B[k][j+3];*

*}*

*C[i][j]=c0;*

*C[i][j+1]=c1;*

*C[i][j+2]=c2;*

*C[i][j+3]=c3;*

*}*

Array copying as a solution - Reduce the number of memory accesses:

## We have explained in the previous lab session why accessing B[][] array in a column-wise order is not efficient; if you are still not sure about the reason, revisit the notes of the previous lab session. We can solve this problem by applying array copying transformation. The new code will be the following

## *for (i=0;i<N;i++)*

## *for (j=0;j<N;j++)*

## *Btranspose[i][j]=B[j][i];*

## *for (i=0;i<N;i++)*

## *for (j=0;j<N;j++)*

## *for (k=0;k<N;k++)*

## *C[i][j]+=A[i][k]\*Btranspose[j][k];*

## An extra loop kernel is added copying the columns of B[][] in Btranspose[][]. This loop kernel adds extra overhead to the program but as you will find out by measuring its execution time, it pays off. In the new version of the program a row of A is multiplied by many rows of Btranspose (not columns).

## **Task6**: Apply the array copying optimization and compare this code with mmm() in terms of a) execution time, b) dL1 accesses and misses, c) FLOPS. Compare the results.

Loop tiling as a solution - Reduce the number of memory accesses:

## Run the ‘*mmm\_tiling\_bad()*’ routine and measure the performance for different tile sizes. The tile size must perfectly divide N. Which tile size is the most efficient? It is important to note that in this case loop tiling has not been applied in an efficient way and there is much room for improvement; the efficient application of loop tiling is an advanced topic and will not be discussed here. For those who want to learn more, you can read the following article [3]; however, this is out of the scope of this lab session. Furthermore, loop tiling can be applied together with array copying and register blocking and thus highly improve performance.

## In mmm\_tiling\_bad() routine, loop tiling has been applied to all the three loops. The tiles are square of size TILExTILE. The main reason that the application of loop tiling is not efficient here is because the tiles’ elements are not written in consecutive main memory locations (all the tiles’ sub-rows are stored into no consecutive memory locations). Thus, although the aggregated tiles size is smaller than the cache, the tiles sub-rows conflict with each other and they cannot remain in the cache. Still, even using loop tiling in an inefficient way, performance is improved as some of the sub-rows remain in the cache.

Software prefetching as a solution:

This paragraph is for your information only as it refers to more advanced topics. Next week, we will learn how to use SSE/AVX x86-64 intrinsics. These include prefetch instructions. All the prefetch instructions supported for x86-64 architectures can be found here <https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=173,5533,3505,1449,3505,2940,2024&text=prefetch>. An example of a software prefetch instruction is shown below

*\_mm\_prefetch(&C[i][j], \_MM\_HINT\_T0);*

The instruction above pre-fetches the cache line containing C[i][j] from DDR. No value is written back to a register and we do not have to wait for the instruction to complete. The cache line is loaded in the background.

## References and Further Reading:

1. Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52, 4 (April 2009), 65-76. DOI=10.1145/1498765.1498785, [available](http://doi.acm.org/10.1145/1498765.1498785) at <https://people.eecs.berkeley.edu/~kubitron/cs252/handouts/papers/RooflineVyNoYellow.pdf>
2. Options That Control Optimization, available at <https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html>
3. V.I. Kelefouras, A. Kritikakou, I. Mporas and V. Kolonias, “A high performance Matrix-Matrix Multiplication methodology for CPU and GPU architectures”, Journal of Supercomputing, Springer, available at <https://link.springer.com/article/10.1007/s11227-015-1613-7>
4. Optimizing compilers for modern architectures: a dependence-based approach, book, available at <https://liveplymouthac-my.sharepoint.com/:b:/g/personal/vasilios_kelefouras_plymouth_ac_uk/EVy4Laj_1W9Hr7D3W57CBuQBeohd0M9iVVT7x5n91PcDyg?e=RGnRCa>